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Parallelism Of Tight-Binding Molecular Dynamics Simulations Is Presented By Means Of 
The Order-N Electronic Structure Theory With The Wannier States, Recently Developed (J. 
Phys. Soc. Jpn. 69,3773 (2000)). An Application Is Tested For Silicon Nanocrystals Of More 
Than Millions Atoms With The Transferable Tight-Binding Hamiltonian. The Efficiency Of 
Parallelism Is Perfect, 98.8%, And The Method Is The Most Suitable To Parallel Computation. 
The Elapse Time For A System Of 2 x 10 6 Atoms Is 3.0 Minutes By A Computer System Of 
64 Processors Of Sgi Origin 3800. The Calculated Results Are In Good Agreement With The 
Results Of The Exact Diagonalization, With An Error Of 2% For The Lattice Constant And 
Errors Less Than 10% For Elastic Constants. 
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1. Introduction 

Accurate large-scale atomistic simulations are very im- 
portant to investigate and to predict various properties 
of materials. For this purpose, the first principle elec- 
tronic structure theories have been extended to calcula- 
tions of the total energy and forces and the first prin- 
ciple molecular dynamics (MD) simulation or the Car- 
Parrinello method 1 ) are now used quite widely in the 
condensed matter physics. However, the systems for the 
first principle MD simulations are practically limited to 
much smaller size, at most, of hundreds atoms and much 
shorter time period of few tens pico-seconds. The other 
extreme is the classical MD simulations with short-range 
interatomic potentials which are applied to systems of 
millions or ten millions atoms with time period of a few 
hundreds pico-seconds. 2, 3 ) Classical MD simulations are 
very useful to investigate nanoscale systems when accu- 
rate interatomic potentials can be used. Even so, appli- 
cability of classical MD simulations is limited to phenom- 
ena in which electronic process does not play an essential 
role. 

Modern material technology is deeply involved in elec- 
tronic processes. Then intense attention has been paid 
to the order-N method for the electronic structure cal- 
culations, whose computational cost increases in linearly 
proportion to the number of electrons. 4, 5 -* 

Novel order-N method is being developed on the basis 
of the Wannier states. 6 ' 7 - > The Wannier states is formally 
defined with the unitary transformation of the occupied 
eigen states. Once we get true Wannier states the 
density matrix can be defined as 



as 



(i) 



The expectation value of any physical quantity X can be 
obtained with the density matrix or the Wannier states 



<A > )=Tr[pX]=^(^|A > |^). 



(2) 



If we put the localization constraint to construct approx- 
imate Wannier states with a loss of certain amount of ac- 
curacy, then we can formulate the order-N method and 
reduce computational cost extremely. Using this order-N 
method, a system of 1.4 x 10 6 atoms was calculated by 
a single CPU standard workstation. 8 ' 

In the present paper we do parallel computation of the 
perturbation procedure of the order-N method, we call 
the perturbative order-N method. Silicon nanocrystals 
are calculated up to a system of 2,097,152 atoms, using 
SGI Origin 3800 system, and the efficiency of parallelism 
is analyzed. To test an accuracy and applicability for cal- 
culation of physical quantities, the lattice constant and 
elastic constants are calculated using cluster systems of 
up to 1,423,909 atoms. The usefulness and the limit of 
the perturbative order-N method will be discussed in de- 
tail. 

2. Theoretical backgrounds 

2.1 The Wannier states 

The Wannier states centered on the j-bond can be 
expressed as 



(3) 



where is the mixing coefficient of the central bond- 
ing orbital \ bj) and Cj is that of the anti-bonding orbital 
|di) on the neighboring i-bond. 6,7 ' The mixing of the 
bonding orbitals on the neighboring bonds are negligibly 
small due to the orthogonality and the completeness, be- 
cause they contribute to other Wannier states. 

For diamond structure crystals, we adopt the trans- 
ferable Hamiltonian H of Kwon et al. 9 ^ The Hamil- 
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tonian includes the tight-binding interactions and the 
short-range repulsive interactions between ion cores. We 
truncate the hopping interactions between the first and 
the second neighbor distances. If we denote sp hy- 
bridized orbitals the bonding orbital \bj) and the 
anti-bonding orbital \aj) are linear combinations of the 
two hybridized orbitals ± \hii})/y/2. 

In the case of silicon crystals, the exact results for a 



system of 512 atoms are |C 



(0)|2 



0.938 and £\ \C 



up to the second bond-steps is 0.995. On the other hand, 
by the first-order perturbation theory the coefficients can 
be given by an equation 



Cf("M) (a^\H\b k ) 



C(°) 

and this gives |cj°^| 2 



(4) 



£b - £a 

0.934. 6 - 7 ) Note that, the first- 

W|2 



order perturbation theory gives the value 1 for J^i |C 
up to the second bond-steps. 

2.2 Perturbative order-N method 

The total energy in the tight-binding formalism is 
given as 



-Etot — Ebs + E n 



(5) 



where E^ s is the band structure (BS) energy and E lcp is 
the repulsive energy. On the basis of the Wannier states 
\ipj), the BS energy and its contribution to forces on 
the I atom (site R/) are written, with the tight- binding 
Hamiltonian H, as 6 - 1 



and 



occ 

E 

3 



(6) 



(7) 



We will calculate the Wannier states by using the per- 
turbative treatment Eqs. (3) and (4) and the density 
matrix should be given in the same equation as Eq. (1) 
with calculated Wannier states \ipj). The computational 
cost of the procedure is linearly scales by the number of 
electrons TV, 6 ' 7 ) and this procedure we call the pertur- 
bative order-N method. 

When we use the variational procedure to obtain the 
Wannier states, the physical quantities should be cal- 
culated in a way consistent with the calculation of the 
Wannier states. 6 ' 7 - 1 Therefore, the density matrix in the 
above Eqs. (6) and (7) should be replaced by the op- 
timal one p = 2p — p 2 , and this procedure we call the 
variational order-N method. 

2.3 Linearly scaling property of perturbative order-N 
method 

The Wannier states and the matrix elements of the 
Hamiltonian are given on the basis of the atomic orbitals 
\4>i a ). Then we can estimate the computational cost in 
the following way. Firstly the above physical quantities 
can be rewritten, by using the matrix elements of the 



Hamiltonian and the density matrix represented by the 
atomic orbitals, as 



Ehs - E EEE^c+aw 



x {<j>i a \H\<i> {I+A)l) ), (8) 



(*)i2 and 



N, lom JV loc N u N v 

E EE E PJ°V+*)P 
J A a f3 

X ((pJa\^^\4>(J+A)p) ■ 



(9) 



The numbers N at om, and N v are those of atoms and 
atomic orbitals per atom, respectively. The number iVi oc 
is that of interacting atoms in the local region around 
the central atom. The local region is defined, outside 
which the matrix elements of the tight-binding Hamilto- 
nian vanish. In the diamond structure, N\ oc = 17 includ- 
ing the central, the first and the second neighbor atoms. 

From Eqs. (8) and (9), the total computation time of 
the matrix elements are scaled by a factor N v x N v x 
N\ oc x N atom , where the factor N v x N v is due to the 
cost for the quantum mechanical calculation. In the sp 3 
minimal basis set, N v is four. Therefore, we can calculate 
each Wannier state by a local procedure and the total 
computation cost is proportional to the number 7V at0 m- 
The procedure is then the perfect order-N method. 

Non-negligible amount of computation time is con- 
sumed in the calculation of the repulsive energy and 
forces. The part of the listing of the neighboring atoms is 
also important and its cost is not negligible for large sys- 
tems. The above two parts are the same as in standard 
classical MD simulations and the computation time of 
these two parts are scaled by a factor iVi oc x -/V at0 m- The 
computation time of the BS energy and forces costs, at 
least, N v x N v times more than those of the calculation of 
repulsive interactions and the listing of the neighboring 
atoms. 

2.4 Allotment of Wannier states to processors, memory 
size and communication of data 

Since the perturbative treatment is completely inde- 
pendent among the Wannier states, we can parallelize the 
computation with respect to several groups of the Wan- 
nier states. When we use iVcpu processors, each pro- 
cessor participates in the calculation of about N/Ncptj 
states among the total N Wannier states. For exam- 
ple, the calculation of { (ijjj\H\ipj} and {^j\-§^\ipj) 
}j=j n _i+i,-,j„ , {in - jn-i ^ N/Ncpv) is allotted to 
the n-th processor. 

The matrix elements are not stored on memory be- 
cause they require totally a large CPU memory. For 
example, the total memory size for the matrix elements 
{((l>ia\H\<pu + A)0)} can be estimated to be 8 (B) x 4 2 x 
17 x 10 6 = 2.2 (GB) for a system of 10 6 atoms. Therefore, 
we calculate the matrix elements when they are required 
and do not store them. 
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Fig. 1. The elapse time of one MD loop by using one processor for 
the systems of 4,096, 32,768, 262,144 and 2,097,152 atoms of Si 
crystals. 




8 16 32 64 
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Fig. 2. The speed-up ratio a p for systems of 262,144 and 2,097,152 
atoms as a function of the number of processors. That for a sys- 
tem of 32,768 atoms locates slightly above that of 262,144 atoms. 
The dashed lines fire th.6 meLxiniuni spesd-up ratio (dp )max of 
P = 0.988 and dotted lines are just (a p ) ma x = Ncpu cor- 
responding to P = 1. The observed deviation of the line of 
2,097,152 atoms at 16 processors is due to the consumption of 
the elapse time in the data communication, presumably because 
of an ill balancing of the data size and the number of processors. 



In the case of 10 6 atoms, the memory size for all atomic 
positions is 8 (B) x 3 x 10 6 = 24 (MB), that for the listing 
of the neighboring atoms is 4 (B) x 16 x 10 6 = 64 (MB), 
and that for the force is 8 (B) x 3 x 10 6 = 24 (MB). Then 
the total memory size for each processor is not large. In 
the present calculation, all processors have the same data 
of all atomic positions and no data are communicated 
among processors during the calculation of the Wannicr 
states and contributions to the BS energy and forces. 

After calculating contributions to the BS energy 
and forces in individual processor, we should sum up 
these elements. This procedure is accomplished by 
MPI_ALLREDUCE command in the Message Passing 
Interface (MPI) in the present calculation. This is waste- 
ful procedure with respect to the communication of the 
data because a large number of communicated data is 
not necessary, that is, null. Notwithstanding, the com- 
munication time would be relatively cheap expenses in 
the present work because there exist very heavy compu- 
tations with respect to quantum mechanical freedoms. 

3. Results and discussions 

3.1 Elapse time of one processor 

We calculated four different systems of 4,096, 32,768, 
262,144 and 2,097,152 atoms of Si crystals with the pe- 
riodic boundary condition. Figure 1 shows the elapse 
time of one MD loop and one can see the almost per- 
fect linearly scaling property. The elapse times for the 
calculation of the BS energy and forces are 97.44 %, 
97.65%, 97.63% and 97.64% of the total elapse times 
in respective systems, which are larger than a factor 
16/(16 + 1) = 0.941, estimated simply in the case of 
N v x N v — 16. Note that the small difference between 
two numbers 0.976 and 0.94 is very serious for the speed- 
up ratio discussed later. From these results, we can con- 
clude that the tight-binding calculation is heavier by a 



factor 2.5 (= . °n 7 JL c x 17 ., 16 ) than our naive estimation. 

v 1 — 0.976 16 ' 

This is very crucial difference in actual simulations, and 
quantum mechanical calculations are much heavier than 
the classical simulations even in the tight-binding cal- 
culation. One should parallelize firstly the part of the 
calculation of the BS energy and forces. In practice, we 
parallelize also the part of the calculation of the repul- 
sive interaction energy. The parallclizable fraction P, the 
fraction of the elapse time of strictly parallclizable part 
among the whole elapse time in the case of one processor, 
reaches to 0.988. This high value of P would be impossi- 
ble without the generic property of the linear scaling of 
the present method. Therefore, the pcrturbativc ordcr- 
N method is one of the most suitable procedures to the 
parallel computation. 

3. 2 Speed-up ratio by parallelism 

When we parallelize the computational program by 
using -/Vcpu processors, the speed-up ratio a p is defined 
as the ratio of the elapse time iAj- CPU of iVcpu processors 
and that t\ of one processor as 

a p ee . (10) 

r iVcpu 

Let us assume that we can parallelize the part of the 
fraction P perfectly. In other words, we assume that 
the elapse time of this part can be reduced by a fac- 
tor 1/iVcpu- I n such strictly parallelized case, the total 
elapse time can be minimized with the maximal speed-up 
ratio 10 ) 

K)max = ti x (— !— ) = - — p— . (11) 

V tjVcPu 1 max (1 - P) + K1 r 

v 7 JVCPU 

The high speed-up ratio is possible only for the high value 
of the parallclizable fraction P. 
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Fig. 3. The lattice constant (a) and elastic constants (b) as a function of N^ 1 ^ 3 , obtained by the perturbative method. 





Present work 


Diag. iJ > 


LDA VA > 


Exp. 


Lattice constant [A] 


5.546 (2.19%) 


5.427 


5.431 


5.429 L3 > 


Bulk modulus [GPa] 


82.3 (6.05%) 


87.6 


93.0 


97.8 14 ) 


Cu - C12 [GPa] 


92.3 (1.70%) 


93.9 


98.0 


101.2 14 ) 


C° 4 [GPa] 


188.3(5.14%) 


198.5 


111.0 




C 44 [GPa] 


97.9 (10.0%) 


89.0 


85.0 


79.6 14 ) 



Table I. Lattice constant and clastic constants extrapolated to N — » 00 systems are compared with those by the diagonalization method 
(Diag.), the first-principles calculation within the Local Density Approximation (LDA), and experiments (Exp.). The values in 
parentheses are errors against those with the diagonalization method. The elastic constant C 44 is that in a system where the internal 
displacement is not relaxed. 



Figure 2 shows the observed speed-up ratio a p by using 
MPI and the maximum one (a p ) ma x with P — 0.988 as a 
function of the number of processors. Compared with the 
minimum elapse time ti/(a p ) max , the present computa- 
tion consumes 2 % more for 32,768 atoms (8 processors), 
18 % more for 262,144 atoms (64 processors), and 45 % 
more for 2,097,152 atoms (64 processors), respectively. 
In the case of 64 processors for 2,097,152 atoms, the 
communication cost is about 30 % among whole elapse 
time. The cost of data communication increases with the 
number of the processors. The above analysis indicates 
that, when one uses a computer system of more than 200 
processors for 2,097,152 atoms, the data communication 
process would consume the majority of the elapse time. 

For a system of 2, 097, 152 atoms, the elapse time of 
one MD loop by one processor is 4565 s (76.1 min). When 
the time interval of one MD step corresponds to 3 femto 
seconds in physical systems, a one pico second simulation 
needs 422 hours (17.6 days) of CPU time by one proces- 
sor. Such simulation becomes feasible when we use 128 
processors. 

3.3 Calculations of elastic constants 

The lattice constant and elastic constants are calcu- 
lated in systems of several sizes of clusters of Si crystal of 



up to 1,423,909 atoms. The boundary condition is such 
that hybridized orbitals of the ideal sp 3 type are fixed 
on the surface atoms but surface atoms can move. Since 
the elastic constants are the linear response to small dis- 
tortions, they may be expected to be reproduced by the 
first order perturbation calculations. 

Figure 3 shows the calculated results with increasing 
the number of atoms N. The deviation can be scaled 
by iV -1 / 3 because it is the effects of the surface. The 
values extrapolated to TV — > oo arc summarized in Table 
I. The present results agree with those of the exact di- 
agonalization method 9 ) within less than 10% error. The 
errors except for C44 are not more than the difference 
between results by the present tight-binding Hamilto- 
nian (Diag.) and the LDA. The deviation of the shear 
modulus C44 is larger than those of the bulk modulus or 
Cn — C12, since C44 is inherently complicated due to the 
rchybridization and the internal distortion 11,12 ) and this 
phenomena cannot be described very accurately by the 
first order perturbation of the Wannier states of fixed 
sp 3 hybrids. The discrepancy between the results with 
the order-N method and those with the diagonalization 
method originates from the localization constraint for 
constructing the Wannier states and the perturbation 
treatment in Eqs. (3) and (4). These restrictions are 
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controllable and the discrepancy is not serious between 
the results by the present order-N method and the diag- 
onalization method. This is a typical example of account 
balancing between the accuracy and the computational 
cost in the order-N method. 

Much larger error is found in the value of C\± of the 
tight-binding calculation itself, compared with that of 
the LDA calculation, which is a limitation of the present 
tight-binding Hamiltonian. One of the important works 
in future is the construction of more accurate tight- 
binding Hamiltonian from the first principle electronic 
structure calculations. Even if such sophisticated Hamil- 
tonian is much complicated, it does not cause any essen- 
tial difficulty in the calculation by the present order-N 
method, though it may increase the CPU times. 

4. Conclusion 

In a summary, we demonstrated the efficiency of par- 
allelism of the perturbative ordcr-N method in the large- 
scale tight-binding MD simulations. The method was 
shown to be the most profitable procedure for the par- 
allel computation. The communication time is, even in 
the present case, much less than the time of the quantum 
mechanical calculations. 

The perturbative order-N method may be hardly ap- 
plied to systems with large distortion of lattices or bond 
breaking because the deviation from the unperturbed 
states becomes very large and, in the bond breaking pro- 
cess, the charge transfer and re-bonding are essentially 
important. In such cases we should combine the pertur- 
bative order-N method with other methods for construct- 
ing basis states. The variational method can be associ- 
ated with the perturbative ordcr-N method and we can 
compose a hybrid order-N method. The hybrid order-N 
method can give electronic structures in the whole sys- 
tem and, more importantly, there is no discontinuous 
boundary in the connected region. The hybrid scheme 
of the perturbative and variational order N-methods has 
been already applied to the fracture propagation in Si 



nanocrystals with 1.4 xlO 6 atoms without parallelism. 8 ' 
The parallelized hybrid ordcr-N method is a very essen- 
tial method to pursue tight-binding MD simulations for 
systems of millions atoms and the perturbative order-N 
method is the key technique in order to enlarge the size 
of the whole systems. 
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